This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.
You can use {include X} to include different sections of your report as separate .qmd files. This is also well documented in the Quarto documentation: https://quarto.org/docs/authoring/includes
As mentioned in the documentation, we have used (_) prefix for the included files (e.g., _introduction.qmd and _data.qmd). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).
Rendering only report.qmd will render also all the other files.
1 Introduction
1.1 Project Goals
Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.
This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.
While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.
1.2 Research Questions
Question 1
What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?
Question 2
Can we predict whether a person will be obese based on some given combinations of factors?
Question 3
How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?
2 Data
2.1 Sources
The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.
Code
library(here)library(knitr)dataset_raw <-read.csv(here("data/raw/dataset_raw.csv"))kable(head(dataset_raw), format ="markdown", caption ="First 6 Rows of dataset_raw")
First 6 Rows of dataset_raw
Gender
Age
Height
Weight
family_history_with_overweight
FAVC
FCVC
NCP
CAEC
SMOKE
CH2O
SCC
FAF
TUE
CALC
MTRANS
NObeyesdad
Female
21
1.62
64.0
yes
no
2
3
Sometimes
no
2
no
0
1
no
Public_Transportation
Normal_Weight
Female
21
1.52
56.0
yes
no
3
3
Sometimes
yes
3
yes
3
0
Sometimes
Public_Transportation
Normal_Weight
Male
23
1.80
77.0
yes
no
2
3
Sometimes
no
2
no
2
1
Frequently
Public_Transportation
Normal_Weight
Male
27
1.80
87.0
no
no
3
3
Sometimes
no
2
no
2
0
Frequently
Walking
Overweight_Level_I
Male
22
1.78
89.8
no
no
2
1
Sometimes
no
2
no
0
0
Sometimes
Public_Transportation
Overweight_Level_II
Male
29
1.62
53.0
no
yes
2
3
Sometimes
no
2
no
0
0
Sometimes
Automobile
Normal_Weight
2.2 Description
The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors.
The variables include:
Code
val_meaning <-c("indicates the gender of the individual (Male/Female).","represents the age of participants in years.","the height of individuals in meters.","the weight of participants in kilograms.","indicates whether a family member has suffered from overweight (Yes/No).","indicates if participants frequently consume high-caloric foods (Yes/No).","scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).","indicates the typical number of main meals consumed daily.","describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).","indicates whether participants smoke (Yes/No).","scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).","whether participants monitor their calorie intake (Yes/No).","scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).","reflects daily time spent on technological devices, in hours.","indicates the frequency of alcohol consumption (e.g., I don't drink, Sometimes, Frequently, Always).","describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).","the target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).")desc_table <- tibble::tibble(Name =colnames(dataset_raw), # Column namesType =sapply(dataset_raw, class), Meaning = val_meaning# Corresponding data types)kable(desc_table, format ="markdown", caption ="Variable Descriptions",align ="lccrr")
Variable Descriptions
Name
Type
Meaning
Gender
character
indicates the gender of the individual (Male/Female).
Age
numeric
represents the age of participants in years.
Height
numeric
the height of individuals in meters.
Weight
numeric
the weight of participants in kilograms.
family_history_with_overweight
character
indicates whether a family member has suffered from overweight (Yes/No).
FAVC
character
indicates if participants frequently consume high-caloric foods (Yes/No).
FCVC
numeric
scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).
NCP
numeric
indicates the typical number of main meals consumed daily.
CAEC
character
describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).
SMOKE
character
indicates whether participants smoke (Yes/No).
CH2O
numeric
scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).
SCC
character
whether participants monitor their calorie intake (Yes/No).
FAF
numeric
scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).
TUE
numeric
reflects daily time spent on technological devices, in hours.
CALC
character
indicates the frequency of alcohol consumption (e.g., I don’t drink, Sometimes, Frequently, Always).
MTRANS
character
describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).
NObeyesdad
character
the target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).
Gender (Categorical): indicates the gender of the individual (Male/Female).
Age (Continuous): represents the age of participants in years.
Height (Continuous): the height of individuals in meters.
Weight (Continuous): the weight of participants in kilograms.
Family History of Overweight (Categorical): indicates whether a family member has suffered from overweight (Yes/No).
Frequent Consumption of High-Caloric Food (FAVC) (Categorical): indicates if participants frequently consume high-caloric foods (Yes/No).
Frequency of Vegetable Consumption (FCVC) (Continuous): scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).
Number of Main Meals per Day (NCP) (Continuous): indicates the typical number of main meals consumed daily.
Consumption of Food Between Meals (CAEC) (Categorical): describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).
Physical Activity Frequency (FAF) (Continuous): scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).
Time Using Technology Devices (TUE) (Continuous): reflects daily time spent on technological devices, in hours.
Alcohol Consumption (CALC) (Categorical): indicates the frequency of alcohol consumption (e.g., I don’t drink, Sometimes, Frequently, Always).
Transportation Method (MTRANS) (Categorical): describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).
Obesity Level (NObeyesdad) (Categorical): the target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).
The dataset has been pre-processed, with normalization applied to continuous variables and categorical data encoded. SMOTE was used to address class imbalance, but care was taken to minimize artificial patterns. Despite the presence of synthetic data (77%), which ensures balance and diversity, and real-world data (23%), which introduces authenticity, the dataset’s combined structure allows for a comprehensive analysis of obesity-related factors while acknowledging potential biases like self-report inaccuracies.
2.3 Wrangling
Import dataset.
Code
library(here)library(knitr)dataset_raw <-read.csv(here("data/raw/dataset_raw.csv"))kable(head(dataset_raw), format ="markdown", caption ="First 6 Rows of dataset_raw")
First 6 Rows of dataset_raw
Gender
Age
Height
Weight
family_history_with_overweight
FAVC
FCVC
NCP
CAEC
SMOKE
CH2O
SCC
FAF
TUE
CALC
MTRANS
NObeyesdad
Female
21
1.62
64.0
yes
no
2
3
Sometimes
no
2
no
0
1
no
Public_Transportation
Normal_Weight
Female
21
1.52
56.0
yes
no
3
3
Sometimes
yes
3
yes
3
0
Sometimes
Public_Transportation
Normal_Weight
Male
23
1.80
77.0
yes
no
2
3
Sometimes
no
2
no
2
1
Frequently
Public_Transportation
Normal_Weight
Male
27
1.80
87.0
no
no
3
3
Sometimes
no
2
no
2
0
Frequently
Walking
Overweight_Level_I
Male
22
1.78
89.8
no
no
2
1
Sometimes
no
2
no
0
0
Sometimes
Public_Transportation
Overweight_Level_II
Male
29
1.62
53.0
no
yes
2
3
Sometimes
no
2
no
0
0
Sometimes
Automobile
Normal_Weight
Load required libraries for data manipulation, visualization, and clustering. Each package serves a specific purpose:
dplyr: For data manipulation (e.g., filtering, summarizing).
tidyr: For data tidying (e.g., reshaping).
ggplot2: For visualization.
corrplot: For correlation matrix visualization.
ggridges: For creating ridge plots.
cluster: For clustering algorithms.
reshape2: For data reshaping, especially during visualization.
Check for missing values in the dataset, missing values are identified by counting NA values for each column.
Code
missing_values <-colSums(is.na(dataset))kable(missing_values, format ="markdown", caption ="Missing Values in Each Column")
Missing Values in Each Column
x
gender
0
age
0
height
0
weight
0
family_hist
0
caloric_food
0
vegetable_food
0
nb_meal_day
0
food_btw_meals
0
smoke
0
ch2o
0
calorie_check
0
physical_act
0
use_tech
0
freq_alcohol
0
m_trans
0
obesity_lev
0
Missing values are identified by counting NA values for each column. All columns contain complete data, with no missing values. If missing data were present, we could address it by either removing rows with missing values using dataset <- na.omit(dataset_row) or imputing missing values with appropriate measures (e.g. mean or median).
Check the structure of the dataset to identify data types for each variable. This helps in identifying columns that need to be converted or standardized.
Code
# Capture the structure of the datasetstr_output <-capture.output(str(dataset))# Convert the structure output to a data framestr_table <-data.frame(Structure = str_output, stringsAsFactors =FALSE)kable(str_table, format ="markdown", caption ="Structure of the Dataset")
We convert specific columns to factors for categorical interpretation during analysis. Factors ensure proper handling of discrete variables in statistical modeling.
We arranged the levels of the obesity categories, food consumption between meals, and the frequency of alcohol use to follow a logical ordinal progression, ensuring these variables accurately reflect increasing severity or frequency for improved interpretability and analysis.
Using str() before and after confirms that each variable has the correct data type, preventing errors during analysis.
Code
str_output <-capture.output(str(dataset))str_table <-data.frame(Structure = str_output, stringsAsFactors =FALSE)kable(str_table, format ="markdown", caption ="Structure of the Dataset")
Check the number of rows after removing duplicates.
Code
nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE
In-depth analysis of SMOTE’s impact and visualization of class Distribution
Code
ggplot(dataset, aes(x = obesity_lev)) +geom_bar(fill ="skyblue", color ="black") +theme_minimal() +labs(title ="Class Distribution of Obesity Levels",x ="Obesity Level",y ="Count" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) #Adjusted the text for clarity
After applying SMOTE, the distribution is noticeably more balanced across all categories, with each class showing a similar count. This outcome reflects SMOTE’s intended effect of addressing class imbalance.
Distribution analysis
Density plot for age.
Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Age Distribution by Obesity Levels",x ="Age",y ="Density",fill ="Obesity Level") +xlim(14, 50) # Limit the x-axis to 0–50
This graph allows us to assess the age distribution across obesity levels and to evaluate the impact of the SMOTE algorithm in generating synthetic data. Two key takeaways emerge: first, the distributions show a clear separation between obesity categories, particularly with younger ages dominating in lower obesity levels (e.g., Insufficient Weight and Normal Weight) and older ages appearing more prominently in higher obesity levels (e.g., Obesity Type II and III). Second, sharp peaks, such as the one around age 30 in “Obesity Type I,” could signal potential artifacts from data synthesis. While these patterns indicate that the dataset maintains logical trends, further validation is necessary to confirm that these separations and peaks reflect realistic population characteristics and not artificial biases introduced during data augmentation. Overall, the dataset appears well-structured, but these observations warrant careful consideration during analysis.
The summary statistics show relatively consistent means and standard deviations for Age, Height, and Weight across obesity levels, which suggests that SMOTE has preserved the overall distribution without introducing extreme values. Interpretation: Since the means and standard deviations are similar across classes, it appears SMOTE didn’t drastically alter the dataset’s variability. This consistency supports the idea that SMOTE effectively balanced the classes without distorting key variable distributions.
Perform K-means clustering and calculate silhouette score.
Silhouette Score from K-means Clustering: The mean silhouette score of approximately 0.456 suggests a moderate level of cohesion within clusters and some separation between them. This score indicates that the clusters (representing obesity levels) are neither too distinct nor too blended. Interpretation: A score close to 0.5 generally reflects reasonable class separability without excessive artificial separability. This score suggests that SMOTE has helped create distinguishable but not overly isolated clusters, which is desirable for class balance. We conclude that SMOTE has balanced the dataset without drastically distorting it.
We verified the presence of any potential NA values that might have arisen during the conversion of categorical variables to numeric format.
Code
nb_na<-colSums(is.na(dataset_num))kable(nb_na, format ="markdown",caption ="Presence of potential NA values in the dataset")
Presence of potential NA values in the dataset
x
gender
0
age
0
height
0
weight
0
family_hist
0
caloric_food
0
vegetable_food
0
nb_meal_day
0
food_btw_meals
0
smoke
0
ch2o
0
calorie_check
0
physical_act
0
use_tech
0
freq_alcohol
0
m_trans
0
obesity_lev
0
The results of the test confirmed that there are no NA values in the dataset, indicating that all variables were successfully converted to numeric format while retaining their integrity.
2.3.2 Listing Anomalies and Outliers
2.4 Correlation Analysis
In order to select the possible factor influencing obesity level.
We computed a correlation matrix to analyze the relationships between numeric variables, focusing on their associations with obesity_lev. Variables were reordered by the strength of their correlation with obesity_lev for clarity. A heatmap was generated using a diverging color gradient to visualize these correlations, with red indicating strong positive relationships, blue for negative, and white for weak or neutral. Numerical labels and rotated axis labels were added to improve interpretability, highlighting key factors linked to obesity levels.
Code
#Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age","weight","height", "family_hist", "caloric_food","vegetable_food", "food_btw_meals", "use_tech", "ch2o","m_trans", "smoke","nb_meal_day", "calorie_check","gender"),use ="complete.obs")#Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]#Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))#Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]#Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5 , hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityaxis.text.y =element_text(angle =45, vjust =1) # Rotate y-axis labels for readability )
Code
# Create the heatmap with correlation values# Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age","weight", "family_hist", "caloric_food","vegetable_food", "food_btw_meals","use_tech","ch2o", "height","calorie_check", "gender"),use ="complete.obs")# Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]# Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))# Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]# Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5 , hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityaxis.text.y =element_text(angle =45, vjust =1) # Rotate y-axis labels for readability )
3 3. Exploratory Data Analysis (EDA)
3.0.0.1 3.1 Descriptive statistics and distribution analysis
3.0.0.1.1 Age
Descriptive statistic for Age
Code
# this is what was written before in case you don't like the new output #summary(dataset$age)# sd(dataset$age, na.rm = TRUE)sum_age_df <- tibble::tibble(Metric =c(names(summary(dataset$age)), "Std. Dev"),Value =c(summary(dataset$age), sd(dataset$age, na.rm =TRUE)))kable(sum_age_df, format ="markdown", caption ="Summary of the age variable")
Summary of the age variable
Metric
Value
Min.
14.000000
1st Qu.
19.915937
Median
22.847618
Mean
24.353090
3rd Qu.
26.000000
Max.
61.000000
Std. Dev
6.368801
Age distribution
The age data shows a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years. The range (14 to 61 years) covers a wide age span, but most individuals are concentrated in the 20–30 age range. The standard deviation (6.35 years) suggests moderate variability in the dataset. This young population distribution may limit the applicability of results to older age groups, where obesity risk factors could differ.
Age Distribution by Obesity Level (Violin Plot)
The age distribution varies across obesity levels,highlighting distinct trends. Insufficient and normal weight categories are concentrated among younger individuals (14–30), while overweight and obesity levels shift towards mid-adulthood (20–40), peaking around 30–35 years. Severe obesity (Type III) is rare in younger ages and more common in the 30–40 range. These patterns suggest the progression of weight issues with age and emphasize the need for targeted interventions during early to mid-adulthood to prevent worsening obesity levels.
Code
ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +geom_violin(trim =FALSE, alpha =0.6) +geom_boxplot(width =0.1, color ="black", fill ="white") +labs(title ="Age Distribution by Obesity Level", x ="Obesity Level", y ="Age") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The violin plot shows, more clearly, how individuals in the lower obesity categories, such as insufficient and normal weight, are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels exhibit a broader age range, with a peak density observed around 30–40 years, particularly in Obesity Type I and Type II. Severe obesity (Type III) is rare in younger individuals and becomes more prominent in the mid-adulthood age group. This visualization underscores the gradual progression of obesity risk with age and emphasizes the critical need for early intervention strategies to address weight-related health issues, particularly during early and mid-adulthood when such risks become more pronounced.
Age Distribution with SMOOTH Trend Line for Obesity Probability.
Code
ggplot(dataset, aes(x = age, y =as.numeric(obesity_lev))) +geom_jitter(alpha =0.3) +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="Trend of Obesity Level with Age", x ="Age", y ="Obesity Level") +theme_minimal()
The graph shows a smooth trend line capturing the overall pattern. Obesity levels increase significantly from adolescence to early adulthood, peaking around the 25–30 years age range. This period potentially represents a critical transition, where lifestyle factors such as reduced physical activity, higher caloric intake, and metabolic changes can contribute to the steep rise in obesity levels.
Beyond the peak, the trend shows a gradual decline in obesity levels after 30 years, which may reflect behavioral changes, such as increased health awareness, dietary improvements, or a selection bias in older age groups. This switch suggests that mid-20s to early-30s is a pivotal stage for interventions aimed at mitigating obesity risk.
3.0.0.1.2 Height
Descriptive statistic for Height.
Code
# summary(dataset$height)# sd(dataset$height, na.rm = TRUE)sum_height_df <- tibble::tibble(Metric =c(names(summary(dataset$height)), "Std. Dev"),Value =c(summary(dataset$height), sd(dataset$height, na.rm =TRUE)))kable(sum_height_df, format ="markdown", caption ="Summary of the height variable")
Summary of the height variable
Metric
Value
Min.
1.4500000
1st Qu.
1.6301785
Median
1.7015840
Mean
1.7026741
3rd Qu.
1.7694915
Max.
1.9800000
Std. Dev
0.0931859
Height distribution.
Code
ggplot(dataset, aes(x = height)) +geom_histogram(bins =20, fill ="purple", color ="black", alpha =0.7) +labs(title ="Height Distribution", x ="Height (m)", y ="Count") +theme_minimal()
The height histogram shows the height distribution (in meters) and is approximately normal, with a slight right skew. Most values fall between 1.45m and 1.98m, with a peak around 1.8m, indicating it’s the most frequent height. The range is realistic, with no visible extreme outliers, and the standard deviation (0.09) indicates low variability. I would like to add that the mean and median are both 1.7m, confirming a nearly symmetrical distribution.
Height by Obesity Level
Box Plot of Height by Obesity Level.
Code
ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +geom_violin(alpha =0.6) +labs(title ="Height Distribution by Obesity Level", x ="Obesity Level", y ="Height") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))
The plot shows for height, relatively low variability within each category, with overlapping ranges between most groups. Individuals with Insufficient Weight and Normal Weight have slightly narrower distributions, centered around similar heights (~1.7 m). As obesity levels increase (e.g., Obesity Type I–III), the distributions remain consistent, suggesting height is not strongly associated with obesity classification. This suggests that weight may be more influential than height alone in determining obesity level.
3.0.0.1.3 Weight
Descriptive statistic for Weight.
Code
# summary(dataset$weight)# sd(dataset$weight, na.rm = TRUE)sum_weight_df <- tibble::tibble(Metric =c(names(summary(dataset$weight)), "Std. Dev"),Value =c(summary(dataset$weight), sd(dataset$weight, na.rm =TRUE)))kable(sum_weight_df, format ="markdown", caption ="Summary of the weight variable")
Summary of the weight variable
Metric
Value
Min.
39.00000
1st Qu.
66.00000
Median
83.10110
Mean
86.85873
3rd Qu.
108.01591
Max.
173.00000
Std. Dev
26.19085
Weight by gender
Density plot for weight distribution by gender.
Code
ggplot(dataset, aes(x = weight, fill = gender)) +geom_density(alpha =0.5) +labs(title ="Density Plot of Weight by Gender", x ="Weight", y ="Density") +scale_fill_manual(values =c("pink", "lightblue"), name ="Gender", labels =c("Female", "Male")) +theme_minimal()
The density plot reveals distinct weight distributions between genders. Females generally weight less, with a peak around 70 units, while males peak around 85 and 115 units, indicating a tendency toward higher weights. The overlapping region around 80-90 units shows weights common to both genders, but the distinct density peaks emphasize gender-based differences in weight distribution. Overall, males dominate at higher ranges Weight ranges from 39 to 173 units, with an average (mean) weight of 86.6 units. The median weight is 83 units, with a standard deviation of 26.2, indicating moderate spread.
Weight by obesity level
Ridgeline Plot of Weight by Obesity Level.
Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +geom_density_ridges(scale =0.9, alpha =0.6) +labs(title ="Ridgeline Plot of Weight by Obesity Level", x ="Weight", y ="Obesity Level") +theme_minimal() +theme(legend.position ="none")
This ridgeline plot shows a clear progression in weight distribution across different obesity levels. As the obesity level increases, the weight distribution shifts progressively to higher ranges. “Normal Weight” and “Insufficient Weight” categories are concentrated at lower weights, while higher obesity types (I, II, and III) peak at significantly greater weights, indicating a strong positive association between weight and obesity level The weight distribution has an average of 86.6 kg and a standard deviation of 26.6 kg.
3.0.0.1.4 Height and Weight
Scatter Plot (height vs weight), colored by obesity level.
Code
ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE, aes(group = obesity_lev)) +# Adds a trend line for each obesity levelggtitle("Scatter Plot of Weight vs Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level")
Facet Grid for Height and Weight by Obesity Level.
Code
ggplot(dataset, aes(x = height, y = weight)) +geom_point(alpha =0.7, aes(color = obesity_lev)) +facet_wrap(~ obesity_lev) +ggtitle("Facet Grid of Weight and Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level") +theme(legend.position ="none")
The scatter plot with trend lines for each obesity level reveals a clear positive correlation between weight and height across all obesity levels. As the obesity level increases, the slope generally becomes steeper, indicating a stronger weight gain relative to height. We created the facet grid to show more clearly the trends to show more clearly how The “Obesity_Type_III” (yellow) category has the steepest slope, suggesting a significant weight increase per unit of height, which is consistent with the highest level of obesity.
Correlation between height and weight.
Code
correlation_height_weight <-cor(dataset$height, dataset$weight, use ="complete.obs")correlation_height_weight
[1] 0.457468
The correlation observed between height and weight (r = 0.463) aligns with existing literature, confirming the expected positive relationship between these variables.
3.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levelsggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +labs(x ="Food Between Meals", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14))
Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +geom_bar(position ="fill") +# Stacked bar chart with proportionsscale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of Food Between Meals Across Obesity Levels") +# Shortened and clear titlelabs(x ="Obesity Levels", y ="Proportion (%)", fill ="Food Between Meals") +# Correct axis and legend labelstheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis text for readabilityplot.title =element_text(hjust =0.5, size =14) # Center and style the title )
These charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.
3.0.0.1.6 High-caloric food consumption
Code
# Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levelsggplot(dataset, aes(x = caloric_food, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle(" Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels") +labs(x ="High-Caloric Food Consumption", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
The dodged bar chart clearly shows that the majority of individuals, especially in the higher obesity categories (Obesity Type I–III), report consuming high-caloric foods (“yes”). This trend becomes increasingly pronounced as obesity levels rise, with very few individuals reporting “no” consumption in these categories. In contrast, lower obesity levels (e.g., Normal Weight, Overweight Level I) show a slightly higher representation of “no” responses, indicating a potential shift in dietary habits across obesity levels.
Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Proportions within each Obesity Level)ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +geom_bar(position ="dodge",aes(y =after_stat(count) /tapply(after_stat(count), after_stat(x), sum)[after_stat(x)]),color ="black") +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +ggtitle(" Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +labs(x ="Obesity Levels", y ="Proportion (%)", fill ="High-Caloric Food Consumption") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14) )
The grouped bar chart effectively shows the behavioral shift toward higher high-caloric food consumption as obesity levels increase. High-caloric food consumption (“yes”) consistently accounts for over 75% of responses, becoming nearly universal in higher obesity categories (Obesity Type I–III). In contrast, “no” responses are more visible in lower obesity levels, such as Insufficient Weight and Normal Weight, but remain a minority.
More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, which may directly contribute to weight gain, highlighting the need for dietary interventions focused on reducing high-calorie intake.
3.0.0.1.7 Alcohol consumption
Frequence in consumption of alcohol.
Code
# Filter out "Always" responses from the datasetfiltered_dataset <- dataset %>%filter(freq_alcohol !="Always")# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +labs(x ="Alcohol Consumption Frequency", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
The chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.
Code
# Prepare the data summary for 'Sometimes' and 'No' responsesdata_summary <- dataset %>%filter(freq_alcohol %in%c("Sometimes", "No")) %>%group_by(obesity_lev, freq_alcohol) %>%summarise(count =n(), .groups ="drop") %>%group_by(obesity_lev) %>%mutate(total =sum(count),proportion = count / total ) %>%ungroup()# Visualization with updated titleggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +labs(x ="Obesity Level", y ="Proportion (%)", color ="Alcohol Frequency") +scale_color_manual(values =c("No"="purple", "Sometimes"="gold")) +# Improved color schemetheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14), # Center and style titlelegend.position ="top" )
The proportion of individuals who drink alcohol “Sometimes” increases with higher obesity levels, peaking in Obesity_Type_III. In contrast, the likelihood of abstaining from alcohol (“no”) decreases as obesity levels rise. This pattern suggests that moderate alcohol consumption may be associated with higher obesity levels, while abstention is more common among those with lower obesity levels.
A possible interaction to investigate later is between alcohol frequency and caloric food preference, as both behaviors seem linked to higher obesity levels. Exploring this could reveal if individuals with a preference for caloric foods and moderate alcohol consumption have a compounding effect on obesity risk. This investigation could help clarify whether combined lifestyle factors contribute more significantly to higher obesity levels than each factor alone.
Monitoring of the calories in the day.
Code
# Dodged Bar Chart for calorie_check by Obesity Levelsggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle(" Dodged Bar Chart for the check of the calories by Obesity Levels") +labs(x ="High-Caloric Food Consumption", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
Code
data_summary <- dataset %>%group_by(obesity_lev, calorie_check) %>%summarise(count =n(), .groups ="drop") %>%mutate(total =sum(count), proportion = count / total)ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent) +scale_color_manual(values =c("no"="lightcoral", "yes"="lightblue")) +labs(title ="Proportion of Calorie Checking by Obesity Level", x ="Obesity Level", y ="Proportion", color ="Calorie Check") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))
The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The second graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.
3.0.0.1.8 Vegetable consumption
Code
ggplot(dataset, aes(x = vegetable_food)) +geom_histogram(aes(y =after_stat(density)), bins =30, fill ="lightgreen", color ="black", alpha =0.6) +geom_density(color ="darkgreen", linewidth =1) +ggtitle("Histogram and Density of Vegetable Food Consumption") +theme_minimal() +labs(x ="Vegetable Food Consumption", y ="Density")
Code
ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Vegetable Food Consumption", x ="Weight", y ="Vegetable Food Consumption") +theme_minimal() +coord_cartesian(xlim=c(40, 135), ylim=c(2, 3))
The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.
This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.
The trend reveals two possible key insights:
A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
3.0.0.1.9 Physical activity
Plot histogram and density.
Code
ggplot(dataset, aes(x = physical_act)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Physical Activity") +theme_minimal() +labs(x ="Physical Activity", y ="Density")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.
Violin plot by category.
Code
ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +# Replace 'obesity_lev' with any category variablegeom_violin(trim =FALSE) +geom_boxplot(width =0.1, color ="black", fill ="white") +ggtitle("Violin Plot of Physical Activity by Obesity Level") +theme_minimal() +labs(x ="Obesity Level", y ="Physical Activity") +theme(legend.position ="none") +theme(axis.text.x =element_text(angle =45, hjust =1))
Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.
The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.
3.0.0.1.10 Water consumption
Plot histogram and density for water consumption.
Code
ggplot(dataset, aes(x = ch2o)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Comsumption of Water") +theme_minimal() +labs(x ="CH2O", y ="Density")
This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.
Violin Plot by Gender.
Code
# Scatterplot with a LOESS trend lineggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Water Consumption", x ="Weight", y ="Water Consumption (ch2o)") +theme_minimal() +coord_cartesian(xlim=c(35, 135))
The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.
This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.
3.0.0.1.11 Technology utilization
Histogram with Density.
Code
ggplot(dataset, aes(x = use_tech)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightblue", color ="black", alpha =0.6) +geom_density(color ="blue", size =1) +labs(title ="Histogram and Density of Use of Technology", x ="Use of Technology", y ="Density") +theme_minimal()
Density of Use of Technology by Obesity Level.
Code
ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +geom_density(alpha =0.5) +labs(title ="Density of Use of Technology by Obesity Level", x ="Use of Technology", y ="Density") +theme_minimal()
This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.
The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.
This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.
4 Analysis
The analysis phase is dedicated to the development, refinement, and comprehensive evaluation of the predictive models, meticulously designed to directly address the previously defined research questions.
4.1 Methods
The modeling process is structured to address the two key research questions:
identifying the most significant lifestyle and behavioral factors contributing to obesity in Mexico, Peru, and Colombia;
predicting whether a person will be obese based on some given combinations of factors.
4.1.1 Linear Regression Model
A linear regression model will be developed to predict an individual’s BMI using weight and height as predictors, reflecting their foundational role in BMI calculation. As emphasized by Mendoza Palechor and De La Hoz Manotas (2019), these variables are fundamental to understanding body composition and are directly tied to the dataset’s variable of obesity levels. By focusing on BMI as a continuous outcome, this approach complements categorical classifications by capturing more detailed variations in body composition across the population. The model will be evaluated using standard regression metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R², ensuring its predictive accuracy and reliability while providing a robust foundation for public health applications.
4.1.2 Logistic Regression Model
To accurately address the key research questions, a logistic regression model will be employed to estimate the probability of individuals belonging to a categorie: obese or not obese. Weight and height will be excluded as predictors in the model because they are directly used to calculate BMI, which serves as the basis for the obesity levels categorized in the dataset. Including these variables would create a dependency between the predictors and the target variable, potentially biasing the analysis. By excluding weight and height, the focus shifts to behavioral and lifestyle factors, such as dietary habits, physical activity, and demographic characteristics, to better understand their influence on obesity risk.
While logistic regression provides a clear and interpretable framework for estimating probabilities, it inherently limits the analysis to a binary classification. This restriction prevents the exploration of the full spectrum of obesity levels, such as Obesity Type I, II, or III, as classified in the dataset. Despite this limitation, logistic regression is a robust method for quantifying the relationships between independent variables and the binary outcome. Feature selection techniques will ensure that only the most relevant predictors are retained, and the model’s performance will be rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, ensuring reliable and actionable insights.
4.1.3 Insights and Limitations
Regression analysis helps us understand how predictors influence outcomes, with logistic regression classifying individuals as obese or not obese and linear regression predicting BMI as a continuous variable. The dataset offers a mix of advantages and challenges: synthetic data ensures balanced representation but lacks the complexity of real-world patterns, while user-collected data adds variability but is prone to biases. Logistic regression simplifies the analysis by focusing on binary outcomes, leaving out the nuanced gradations of obesity, and assumes linearity, which may not fully capture complex relationships. Linear regression relies on accurate weight and height data, making it sensitive to reporting errors. Despite these limitations, the models offer insights into obesity risk and body composition, serving as a valuable exercise and foundation for future projects, even if not directly applicable to real-world scenarios.
4.2 Goals for Each Method
4.2.1Linear Regression Model Development
Data Loading and Processing
The dataset was imported, and initial exploration was conducted to understand its structure. BMI was calculated as a key variable, and missing values were addressed by removing incomplete rows. Boxplots were used to visualize the distributions of key variables, ensuring the dataset was ready for analysis.
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female 21 1.62 64.0 yes no 2 3
2 Female 21 1.52 56.0 yes no 3 3
3 Male 23 1.80 77.0 yes no 2 3
4 Male 27 1.80 87.0 no no 3 3
5 Male 22 1.78 89.8 no no 2 1
6 Male 29 1.62 53.0 no yes 2 3
CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
1 Sometimes no 2 no 0 1 no Public_Transportation
2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
3 Sometimes no 2 no 2 1 Frequently Public_Transportation
4 Sometimes no 2 no 2 0 Frequently Walking
5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
6 Sometimes no 2 no 0 0 Sometimes Automobile
NObeyesdad
1 Normal_Weight
2 Normal_Weight
3 Normal_Weight
4 Overweight_Level_I
5 Overweight_Level_II
6 Normal_Weight
Code
summary(dataset_raw)
Gender Age Height Weight
Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
Mode :character Median :22.78 Median :1.700 Median : 83.00
Mean :24.31 Mean :1.702 Mean : 86.59
3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
Max. :61.00 Max. :1.980 Max. :173.00
family_history_with_overweight FAVC FCVC
Length:2111 Length:2111 Min. :1.000
Class :character Class :character 1st Qu.:2.000
Mode :character Mode :character Median :2.386
Mean :2.419
3rd Qu.:3.000
Max. :3.000
NCP CAEC SMOKE CH2O
Min. :1.000 Length:2111 Length:2111 Min. :1.000
1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
Median :3.000 Mode :character Mode :character Median :2.000
Mean :2.686 Mean :2.008
3rd Qu.:3.000 3rd Qu.:2.477
Max. :4.000 Max. :3.000
SCC FAF TUE CALC
Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
Mode :character Median :1.0000 Median :0.6253 Mode :character
Mean :1.0103 Mean :0.6579
3rd Qu.:1.6667 3rd Qu.:1.0000
Max. :3.0000 Max. :2.0000
MTRANS NObeyesdad
Length:2111 Length:2111
Class :character Class :character
Mode :character Mode :character
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 24.33 28.72 29.70 36.02 50.81
Code
if (sum(is.na(dataset_raw)) >0) {cat("Missing values detected! Removing rows with NA.\n") dataset_raw <-na.omit(dataset_raw)}boxplot(dataset_raw$Weight, main ="Weight Distribution", col ="grey", border ="black", notch =TRUE, horizontal =TRUE, xlab ="Weight (kg)", ylim =c(30, 200))grid(nx =NULL, ny =NULL, lty =0.5, col ="black")
Code
boxplot(dataset_raw$Height, main ="Height Distribution", col ="lightgreen", border ="darkgreen", notch =TRUE, ylab ="Height (m)", ylim =c(1.4, 2))
Linear Regression Model Development
A linear regression model was built to examine the relationship between BMI, weight, and height. The model summary provided key performance metrics and insights into variable contributions.
Code
linear_model <-lm(BMI ~ Weight + Height, data = dataset_raw)summary(linear_model)
Call:
lm(formula = BMI ~ Weight + Height, data = dataset_raw)
Residuals:
Min 1Q Median 3Q Max
-4.0892 -0.3809 0.1300 0.4007 2.4948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.626e+01 3.455e-01 162.8 <2e-16 ***
Weight 3.403e-01 7.767e-04 438.1 <2e-16 ***
Height -3.292e+01 2.180e-01 -151.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8282 on 2108 degrees of freedom
Multiple R-squared: 0.9893, Adjusted R-squared: 0.9893
F-statistic: 9.767e+04 on 2 and 2108 DF, p-value: < 2.2e-16
Evaluation
The model was evaluated by generating predictions and calculating key performance metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics assess the model’s accuracy and its ability to explain the variance in BMI.
A scatter plot was created to compare actual BMI values with predicted values. A reference line (ideal fit) was added to assess the alignment of predictions with true values. The plot provides a visual representation of the model’s accuracy.
Code
plot(dataset_raw$BMI, predictions, xlab ="Actual BMI", ylab ="Predicted BMI", main ="Comparison between actual and predicted BMI", pch =16, col ="blue", cex =0.6) abline(0, 1, col ="red", lwd =2)legend("topleft", legend =c("Predicted Values", "Regression Line"), col =c("blue", "red"), pch =c(16, NA), lty =c(NA, 1), bty ="n")
Diagnostics
Diagnostic plots were generated to evaluate the linear regression model’s assumptions, including residual patterns, normality, and variance consistency. A histogram of residuals was also created to assess their distribution, with a vertical reference line highlighting the zero-residual point.
hist(residuals(linear_model), col ="Gray", border ="white", main ="Residuals Distribution", xlab ="Residuals", ylab ="Frequency", breaks =15, cex.main =1.2, cex.lab =1.2, cex.axis =1.2)abline(v =0, col ="red", lwd =2, lty =2)
4.2.2Logistic Regression Model Development
4.3 Results
4.3.1 5. Conclusion
So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.
The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.
These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.
4.4 Next Steps
Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.
4.5 Final Thoughts
Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.
Source Code
---title:Project Update Report (Group G): Code and Structureauthor: - Alessandro Pizzi - Andrea Lovato - Ayman El Abed - Illia Dorofieievinstitute: University of Lausannedate: todaytitle-block-banner: "#0095C8" # chosen for the university of lausannetoc: truetoc-location: rightformat: html: number-sections: true html-math-method: katex self-contained: true code-overflow: wrap code-fold: true code-tools: true include-in-header: # add custom css to make the text in the `</> Code` dropdown black text: | <style type="text/css"> .quarto-title-banner a { color: #000000; } </style> pdf: # use this if you want to render pdfs instead include-in-header: # wrapping the code also in the pdf (otherwise, it overflows) text: | \usepackage{fvextra} \DefineVerbatimEnvironment{Highlighting}{Verbatim}{ commandchars=\\\{\}, breaklines, breaknonspaceingroup, breakanywhere }abstract: | This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.---```{r}#| label: setup#| echo: false#| message: false# loading all the necessary packagessource(here::here("src", "setup.R"))```::: {.callout-tip}### How to include sections separately- You can use `{include X}` to include different sections of your report as separate `.qmd` files. This is also well documented in the Quarto documentation: <https://quarto.org/docs/authoring/includes>- As mentioned in the documentation, we have used (_) prefix for the included files (e.g., `_introduction.qmd` and `_data.qmd`). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).- Rendering only `report.qmd` will render also all the other files.:::{{< include sections/_introduction.qmd >}}{{< include sections/_data.qmd >}}{{< include sections/_eda.qmd >}}{{< include sections/_analysis.qmd >}}{{< include sections/_conclusion.qmd >}}